[Target] Improve data quality for a RAG system by cleaning and structuring reference documents [Method] One-off project focusing on document processing, normalization, and metadata tagging [UI/UX] Not applicable [Stack] Python, unstructured.io, LlamaParse, PyMuPDF, Docling, Pinecone, Weaviate, Qdrant [Security] Not specified [Format] JSON

upwork.com 🟠 2026-04-24

🔹 [Target] Improve data quality for a RAG system by cleaning and structuring reference documents [Method] One-off project focusing on document processing, normalization, and metadata tagging [UI/UX] Not applicable [Stack] Python, unstructured.io, LlamaParse, PyMuPDF, Docling, Pinecone, Weaviate, Qdrant [Security] Not specified [Format] JSON
👤 Client: GBR Member since 2026-03-09
💰 Price: $500
🚩 Problem: Ensure high-quality data for a RAG system by cleaning and structuring reference documents.
📦 Existing: Not specified

Specifications:

[Target] Improve data quality for a RAG system by cleaning and structuring reference documents
[Method] One-off project focusing on document processing, normalization, and metadata tagging
[UI/UX] Not applicable
[Stack] Python, unstructured.io, LlamaParse, PyMuPDF, Docling, Pinecone, Weaviate, Qdrant
[Security] Not specified
[Format] JSON

Workflow:

1. Assess document types (PDFs, Word, HTML) and identify any OCR issues.
2. Develop a cleaning pipeline to remove headers/footers, fix broken text, handle multi-column layouts, etc.
3. Implement structured chunking that respects the document hierarchy.
4. Extract tables while preserving their structure in markdown or JSON format.
5. Define and implement a metadata schema supporting source attribution down to section/page level.
6. Output clean, chunked, metadata-tagged data ready for vector database ingestion.
7. Review and improve the existing RAG setup focusing on embedding choice, retrieval quality, and tuning for sub-second latency.
8. Document findings and handover process with developer via a short walkthrough call.

⚡ Receive notifications instantly Join our community.

Discord Telegram

Our Social Networks

LinkedIn Twitter Facebook

🕷️️ Job Radar • SCRAPING